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ABSTRACT 


With the emergence of MOOCs, it becomes crucial to automate the 
process of a course design to accommodate the diverse learning 
demands of students. Modeling the relationships among educational 
topics is a fundamental first step for automating curriculum planning 
and course design. In this paper, we introduce Topic Transition Map 
(TTM), a general structure that models the content of MOOCs at 
the topic level. TTMs capture the various ways instructors organize 
topics in their courses by modeling the transitions between topics. We 
investigate and analyze four different methods that can be exploited 
to learn the Topic Transition Map: 1) Pairwise Constrained K-Means, 
2) Mixture of Unigram Language Model, 3) Hidden Markov Mixture 
Model, and 4) Structural Topic Model. To evaluated the effectiveness 
of these methods, we qualitatively compare the topic transition maps 
generated by each model and investigate how the Topic Transition 
Map can be used in three sequencing tasks: 1) determining the 
correct sequence, 2) predicting the next lecture, and 3) predicting the 
sequence of lectures. Our evaluation revealed that PCK-Means has 
the highest performance in the first task, HMMULM outperforms 
other methods in task 2, while there is no winning in task 3. 


Keywords 
Topic Transition Map, Topic Transition, Word Distribution, Mixture 
Model, Hidden Markov Model, Clusters, Sequencing Tasks. 


1. INTRODUCTION 


For many decades, the process of creating courses has been a manual 


task that needs to be carefully managed by instructors and experts. 


However, with the recent advances in technologies and the emergence 
of Massive Open Online Courses (MOOCs), it becomes critical 
to automate the process of course design to accommodate the 
heterogeneity of online students and their diverse needs. According 
to [32], learning on demand is considered one factor that causes 
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the high dropout rate in MOOCs. Learners have different learning 
demands depending on their motivations and goals. For instance, 
learners may seek knowledge about an interdisciplinary domain 
and hence need to learn modules from courses in several areas. 
This problem requires adopting a model in which MOOCs are used 
as modularized resources, rather than a set of pre-designed static 
courses. A crucial first step toward developing such a model is the 
automation of course plan design by sequencing lectures among 
different courses. 


The main principle in designing the curriculum of any course is to 
organize course content according to some relations between topics. 
For instance, to help students to learn the materials, instructors 
carefully organize lectures as a sequence, based on the difficulty 
levels of topics [10, 27, 1] as well as the dependency relations 
between topics [11, 21, 23, 1]. The fundamental sequential structure 
of a course design is to place topics that are easy or prerequisite 
in earlier lectures while more advanced and dependent topics are 
taught in later lectures [1]. Consequently, modeling the relatedness 
among educational topics is a very crucial first step for automating 
curriculum planning and course design. 


Modeling the content structure of MOOCs has recently attracted 
much research. Most of the current research has focused on modeling 
the prerequisite relationships between courses [29, 15], between lec- 
tures or segments of lectures [6, 7], or between concepts discussed 
within or across courses [3, 14, 17, 29, 15]. Using concepts to model 
MOOCs’ content can be easily generalized to capture the relations 
in the concept space. However, because concepts are represented 
as keywords or phrases, it is hard to capture the different levels of 
granularity between lectures and courses. In addition, modeling pre- 
requisite relationships between concepts cannot capture the various 
learning paths accommodated by different courses. 


In this paper, we introduce the Topic Transition Map, a general 
structure that models the educational materials at the topic level. 
We model a course as a set of topics, and each topic is a set of 
concepts. Modeling content at the topic level is a more natural way 
to design custom course plans. We can think of a course as a path in 
the generalized Topic Transition Map. Thus, designing a new course 
becomes a task of identifying a path in the Topic Transition Map. 
Additionally, we investigate four methods that can be leveraged to 
construct the Topic Transition Map: Pairwise Constrained K-Means 
(PCK-Means) [2], Mixture of Unigram Language Model (MULM), 
Hidden Markov Mixture Model (HMMULM), and Structural Topic 
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Model (strTM) [24]. We analyze and compare the Topic Transition 
Maps learned by these methods by studying how to exploit the 
Topic Transition Map in three sequencing tasks: 1) determining the 
correct sequence, 2) predicting the next lecture, and 3) predicting 
the sequence of lectures. To the best of our knowledge, we are the 
first work to introduce and investigate the use of Topic Transition 
Maps in modeling MOOCs content and sequencing lectures. 


To evaluate the effectiveness of all methods, we use real MOOCs from 
three different domains: Python, Structural Query Language, and 
Machine Learning Clustering algorithms. Our evaluation revealed 
that while the PCK-Means has the highest performance on the task 
of finding the best sequence from a list of possible sequences, the 
HMMULM achieves the best performance on the task that predicts 
the next lecture in the sequence. Additionally, all methods perform 
similarly in the task of predicting the whole sequence with MULM 
has the lowest performance as it sometimes cannot predict the whole 
sequence. In addition to comparing various models in sequencing 
tasks, we visualize the Topic Transition Maps generated by different 


methods to qualitatively compare the resulted topic transition maps. 


We found that PCK-Means has extracted more meaningful topics 
with the best word distributions that clearly explain each topic. 


The rest of the paper is organized as follows. In section 2, we present 
some of the related work. Section 3, defines the topics and topic 


transitions and states some applications of the topic transition maps. 


In section 4, we formally define our problem before describing the 
four different methods we exploit to construct the topic transition 
maps in section 5. Section 6 elaborates on our approach for the 
evaluation and the analysis of various models. Finally, we conclude 
our work in section 7. 


2. RELATED WORK 

Most of the work that models the content of MOOCs has focused 
on capturing the prerequisite relationships using different levels of 
granularity such as courses [29, 15], lectures or segments of lectures 
[6, 7], or concepts discussed within or across courses [3, 14, 17, 29, 
15]. Modeling the relations between courses, lectures, or seqments of 
lectures is restricted to these units and cannot be generalized. While 
modeling dependency relations between concepts is considered a 
general structure that captures the required concepts before learning 
any concept, prerequisite relations cannot model the various learning 
paths accommodated by different courses. ALSaad and Alawini [2] 
have addressed this problem by proposing the precedence graph 
that captures the similarities and variations of learning paths among 
different courses. We build on their work and introduce the Topic 
Transition Map that maps each lecture to a topic and leverages the 
sequences of lectures among courses to capture the topic transitions 
pattern and hence the likelihood of such a transition. The main 
difference between the Topic Transition Map and the precedence 
graph is that Topic Transition Map models self transitions between a 
topic and itself and also captures how likely each topic to be the first 
topic in courses. While ALSaad and Alawini [2] have investigated the 
use of PCK-Means in modeling the precedence graph, in this paper, 
we explore three more methods in addition to PCK-Mean, namely 
MULM, HMMULM, and strTM, for modeling Topic Transition 
Maps. We also examine the impact of the learned topic transition 
maps on three different sequencing tasks. We believe that we are the 
first work that examines the use of topic transitions modeled from 
existing MOOCs to learn how to sequence new courses. 


Some research has investigated the use of prerequisite relations 


between concepts to construct and sequence learning units [1, 16]. 


Both studies [1, 16] have developed supervised approaches based 
on feature engineering that extracted features from some external 
knowledge such as Wikipedia [1] and DBpedia [16] to infer the 
prerequisite relations between concepts. Our work is different as 
instead of modeling the prerequisite relations between concepts 
using supervised approaches, we model the Topic Transition Map or 
the various paths between topics using unsupervised methods, where 
a topic is a set of concepts. In addition, our methods rely only on the 
content of MOOCs without using any external knowledge. While 
Agrawal et al. [1] used the concept dependency graph to organize 
concepts to construct learning units and then sequence the learning 
units, we use lectures from existing MOOCs and investigate the 
impact of the learned transitions between topics to sequence lectures. 


The most relevant research to our study is the work by Shen et al. [22]. 
Shen et al. [22] have proposed a method for linking similar courses 
to construct a map of lectures connected by two types of relations: 
similar and prerequisite. The constructed map only captures the 
similarity and prerequisite relations between certain units (lectures) 
and is not generalized to other lectures and thus cannot be used to 
predict the sequence of new lectures. In this paper, we map lectures 
to topics and construct the Topic Transition Map that depicts the 
precedence relations between topics and hence not tied with any 
specific units. Having a generalized Topic Transition Map can help 
in finding the sequence of lectures or predict the next lecture in the 
sequence as we discuss in section 6.2. 


Another related line of research is the work on structural topic 
modeling by the Natural Language Processing, NLP, Community. 
In NLP, topic transitions have been used to model latent topical 
structures inside documents by assuming each sentence is generated 
from a topic where topics satisfy the first order Markov property 
(12, 25]. While Gruber et al. [12] only modeled the transition between 
topics as a binary relation (either remain on the current topic or 
shift to a new topic with a certain probability), Wang et al. [25] 
have developed a Structural Topic Model called strTM to explicitly 
model the topic transitions as probabilities that capture how likely 
one transits from a topic to another. Modeling transitions have been 
used in many applications related to NLP such as sentence ordering 
[25], topic segmentation [9], and multi-documents summarization 
[28]. In this paper, we investigate the use of topic transitions on 
modeling the topical structures in MOOCs by assuming a lectures 
is generated by one topic and use the sequences of lectures to learn 
the transitions between topics. We also explore the impact of using 
the Topic Transition Map to sequence lectures in three different 
sequencing tasks. 


3. TOPIC TRANSITIONS 


Before defining the topic transitions, it is important to briefly explain 
our representation of topics used in this paper. Similar to the definition 
of topics in the literature of the topic modeling research [5, 13], 
we define a topic as a distribution of concepts where concepts 
with higher probabilities tend to explain or characterize the topic. 
Concepts can be represented as words or phrases of words [3, 18, 26]. 
Each lecture is a composition of concepts and hence can be mapped 
to some topics. Depending on the length of lectures, lectures can 
cover one or more topics. Longer lectures usually cover more topics 
than shorter lectures. For example, traditional university lectures 
tend to be more elaborated and have longer duration than MOOCs 
lectures, which are usually concise and short in length. Therefore, 
the number of topics per lecture discussed in MOOCs is less than 
that of traditional university lectures. In this paper, since our work 
focus on learning the topic transitions from MOOCs, we assume 
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that each lecture is mapped only to one topic. This assumption is 


reasonable as lectures in MOOCs are concise and short in length. 


Having this assumption is also very useful as it helps in leveraging 
the sequences of lectures to learn the relations between topics. 


A topic transition captures the precedence relations between topics. 


In other words, it means how likely instructors move or transit from 
one topic to another in the course delivery. It models the various 
ways of how instructors dynamically assemble concepts from the 


concepts space in order to construct the study plan of their courses. 


For instance, some instructors decide to start their Python course by 
explaining the topics: data types, conditional statements, loops, and 
then reading and writing from files. Other instructors may choose a 
different order, such as conditional statements, loops, string, and then 
lists. By leveraging the sequences of lectures from multiple courses, 
we can infer the latent topics of each lecture and hence model the 
common transition patters shared by multiple courses as well as the 
variations of different transitions or paths. To determine the strength 
or how common that transition is, each transition is attached with a 
score or a probability. For example, given the topics in Computer 
Science Programming: “Conditional Statements”, “Loops”, and 
“Arrays”. It is more likely that instructors will explain the topic 
“Conditional Statements” immediately before the topic “Loops” and 
thus the topic transition score between them would be higher than 
the transition score between the topic “Conditional Statements” and 
the topic “Arrays”. 


Learning the topic transitions can be the initial block to facilitate 
several useful applications that can support modern learning. For 
instance, we can use the Topic Transition Map to extract the most 
common paths of topics in the field or explain the topic space in 
the current MOOC offerings. Learners can use transition maps to 
get more insights about the structure of topics in MOOC offerings. 
On the other hand, instructors can use these maps to improve their 
course offerings by examining the topic structure of related courses. 


One important application of the Topic Transition Map is to support 
automatic curriculum planning and course design. Since courses 
consist of topics, learning the relations between topics would be the 
initial step to understand how likely instructors transit from one topic 
to another. We can think of a course as a path in the generalized 
Topic Transition Map. Thus, designing a new course becomes a task 
of identifying a path in the Topic Transition Map. In this paper, we 
analyze how can we use the learned topic transitions to sequence 
new courses. 


4. PROBLEM FORMULATION 


In this section, we formally formulate our problem. Given a set of 
courses C = {X1, X2, X3,..., Xn} from a particular domain, 
where JN is the total number of courses. We assume that courses in 
C are similar and hence have some content overlaps between them 
and also have the same difficulty level (e.g. Beginner, Medium, or 
Advance). A course X; is represented as an ordered list of lectures 
X; = (vi, %:2,...,4)x,|], where |X| is the total number of 
lectures in the course X;. Each lecture is a composition of concepts 
represented in some narrative way. In this paper, we assume a 
concept as a single word and hence lectures are represented using a 
bag-of-word representation. 


Given the number of topics M, our goal is to map each lecture to a 
topic and leverage the sequences of lectures to learn topic transitions 
and construct the Topic Transition Map. The Topic Transition Map 
is represented as a matrix A, where A € R“*™. Each entry ai; of 


the matrix A represents the likelihood of the transition from topic 7 
to topic 7. It reflect how common the precedence relation from topic 
7 to 7 in the dataset courses. In addition to the Topic Transition Map, 
we also aim to learn the probability of each topic being an initial 
topic in courses. We denote the initial probability of each topic as 
a vector 7, where 7 € R™. Along with the Topic Transition Map 
and the initial probability of each topic, it is important to model 
the word distribution of each topic, which represented as a matrix 
B ¢ R™~", where V is the vocabulary size. 


5. MODELING TOPIC TRANSITIONS 


In this section, we explain the four different models we exploit to 
capture topics and Topic Transition Maps. 


5.1 Pairwise Constrained K-Means 

PCK-Means clustering algorithm [4] is a variation of the standard 
K-Means algorithm. To cluster instances, PCK-Means incorporates 
distance between points as well as pairwise constraints to guide the 
clustering process. Since the purpose of clustering is to capture topic 
transition patterns across courses, using PCK-Means helps to restrict 
the clustering process to cluster lectures across courses instead 
of within courses [2]. To guide the clustering, PCK-Means uses 
two types of constraints: Must-Link and Cannot-Link. Must-Link 
constraint determines lecture pairs that need to be clustered together, 
while Cannot-Link constraint specifies pairs that should not be 
grouped into the same cluster. To find the clusters, PCK-Means uses 
an objective function that minimizes both: 1) the distance between 
points (lectures) and the cluster centroid, and 2) the penalty costs of 
violating the constraints. For more information about PCK-Menas, 
please refer to [4]. 


Similar to ALSaad and Alawini [2], we use PCK-Means to build the 
Topic Transition Map A. We first construct the list of Must-Link and 
Cannot-Link constraints to clusters lectures based on their content 
similarity into clusters. We assume that each cluster forms a topic 
and hence we need to learn the word distributions of each topic 
along with topic transitions. We link clusters by using the precedence 
relations between adjacent lectures and capture the strength of the 
transition by accumulating the frequency of transitions. To find 
the word distribution of each cluster or topic in the matrix B, we 
accumulate the vector representations of each lecture that belongs to 
the same cluster. For more information, please see [2]. 


In order to estimate the initial probability 7 for each topic, we simply 
count the number of times of each topic being the first topic in the 
set of courses C’. Then we do normalization to find the probability. 


5.2 Mixture of Unigram Language Models 

To capture topics, we use a mixture model of / unigram language 
models (MULM) with a bag-of-words representation. The mixture 
model is a generative probabilistic model that has been used for 
documents clustering. Thus, it will help in clustering lectures based 
on their topics, where each lecture belongs only to one cluster or one 
topic. In the mixture model, to generate a document, first one needs 
to choose the topic of the document according to the probability 
P(6;), where M is the number of topics, and then generate all the 
words in the document using the probability P(w|0;). According 
to the model, the likelihoods of a document x and the corpus C’ are 
calculated as follows: 


P(A) = Y7 PO) TT Plwlay iG 


wEeVv 
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P(C|A) = 1" (a; |X) (2) 


To estimate the model parameters \ = ({6;}, P(0:)), where 6; is the 
word distribution of topic 2 from the matrix B , we use Expectation- 
Maximization algorithm [8] to find the parameters that maximize 
the likelihood of the data: 


= argmax P(C|A) (3) 
BN 


After learning the parameters, we map each lecture to a cluster or 
topic by maximizing the following equation: 


c = arg max P(2|z;) (4) 


24 


Using the mixture model can help in clustering lectures according 
to their topics, however; it will not capture the transition patterns 
between topics or the initial probability of each topic. Therefore, 
similar to the PCK-Means method, we leverage the sequences of 
lectures to calculate the score of topic transitions and construct the 
Topic Transition Map A. Likewise, we count the number of times 
courses start with each topic and normalize the results to model the 
initial probability 7. 


5.3. Hidden Markov Mixture Models 


Instead of separately clustering lectures and then learning the transi- 
tions between them, using a Hidden Markov Model would allow us 
to jointly learn the word distributions of each topic (B), the transition 
probabilities between topics (A) as well as the initial probability of 
each topic (7). 


The Hidden Markov Model, HMM, is a probabilistic graphical model 
that describes the process of generating a sequence of observable 
events according to some hidden factors [20]. It simulates how the 
real world sequence data is generated from hidden states. Particularly, 
it consists of two stochastic processes: 1) invisible process, and 2) 
visible process [30]. In HMM, invisible process consists of hidden 
states whereas visible process is observed sequence of symbols that 
are drawn from the probability distributions of the hidden states. 
Figure 1 demonstrates the HMM model. As you can see from Figure 
1, each observable event in the sequence are generated from a hidden 
state and observations are conditionally independent given the hidden 
state. You can also notice that the hidden states form a Markov chain 
where each hidden state depends only on the previous state such as 
Zt+1 depends on Z;. 


To control the process of generating the observed sequences from 
hidden states, HMM has three parameters: 7, A, and B. The first 
parameter, 7 = 71, 72,..., 7m, 1s the initial probability distribution 
of each hidden state. The parameter 7 determines the probability 
of the Markov chain to start at each state and hence controls which 
state can be chosen as an initial state for the observed sequence. 
The second parameter, A € R™*™, is the transition probability 
matrix that specifies how likely the model can transit from one state 
to another, denoted by P(Z+41|Z:) in Figure 1. The third parameter, 
B € R™™%” is the emission probability matrix, where V is the total 
number of distinct symbols. It determines the likelihood of each 
state to produce each symbol, denoted by P(X++1|Z++1) in Figure 
1. For example, to generate a sentence, a sequence of words would 
be drawn from the HMM model according to the three parameters 
a, A, and B. 


In MOOCs, we only observe courses, where courses are sequence of 
lectures, while the topics of lectures and the transition between them 
are invisible or latent. Therefore, HMM would be a great model to 
simulate the generation process of courses and hence infer the latent 
states that contribute in the evolution of these lectures. In HMM, 
each hidden state generates only one symbol or word (see Figure 
1). As our goal is to capture topic transitions using sequences of 
lectures as observed data, we map each lecture to a topic and assume 
each hidden state generates a lecture instead of a word. Our revised 
HMM assumes that each hidden state produces one lecture where 
each lecture is a bag-of-words. We ignore the sequence of words 
in lectures since the order of the words would not contribute to 
capturing the topic of each lecture. Figure 2 depicts the HAMULM 
utilized to capture the content of MOOCs. 


In order to capture both the lectures’ topics and the transitions be- 
tween them, we combine the mixture model (MULM) with HMM, 
and we call the new model Hidden Markov Mixture of Unigram 
Language Model (HMMULM). To do that, we assume the Marko- 
vian assumption between topics where in the generation process, 
the choice of the next topic depends only on the current topic. Even 
though, the choice of the topic in the course delivery depends on the 
previous topics discussed so far, this simplified assumption makes 
sense due to the locality of reference property [1] of course design. 
Based on this property, when an instructor designs a course, a depen- 
dent lecture should appear as soon as possible after the prerequisite 
lecture to reduce students comprehension burden. Therefore, assum- 
ing the dependency between adjacent lectures not only simplifies 
the model but also aids in capturing the transitions between highly 
related topics. By combining the HMM with mixture model the 
likelihood of generating a course is as follow: 


P(X|\) = 5 P(Z|A)P(X|Z, A) 
all Z 
T 
= = 5, P(z1) P@|aILP (24|2t-1) P(xt|zz) 
all Z t=2 
= PGi) [] Pla) yolwsea) 
all Z weV 
NG 2t|Zte—1 TI P(wlzt yrs) 
weVv 
M M 
= So Sora = Si) Il Bla = si, w) Ord) 
i=1 j=1 weV 
T 
[ [4G = $j, 24 = $;) II Bla = 83, w) rr) 
t=2 weVv 


(5) 


To estimate the HMMULM parameters \ = (7, A,B), we use a 
modified version of Baum-Welch algorithm in order to model the 
observation sequences as a multidimensional categorical events. 
Following the work [19], we derived the equations of E-step and 
M-step to train the model and infer the transition probability between 
topics. In the E-step, we use the equations: 
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Transition probability 
P( Zt | Zt) 


Sequence of states (@) Jenin (2) Eee > Zur jeeeeeee (2) 
\ 


Emission probability | | P(Xt+1 | ze) 


Sequence of observation () (*) (=) (*) 


t t+1 
Time 


Figure 1: The graphical model of HMM. 


Transition probability 
Sequence of states 
Emission probability 
observation 
x1 xt Xt Xt 
t 


Time 


Figure 2: The graphical model of HAMULM used to 
model the content of MOOCs. 


y(t) = P(zt = si|X, ) 


_ __ ae(%)Be(%) (6) 
jer (9) Be (4) 
&:(2, 7) = Plz = 81,241 = 3; |X, d) 


a(t) Aij Bit (9) vey Bi (w)re+1) 


Diver Dogar O4(8) Aaz Bt41(5) Tlvev Bi(w)rre) 


In the M-step, the following equations are used to choose the pa- 
rameters that maximize the likelihood of the observed sequence of 
lectures: 


(i) =%71 (2) (8) 
Ag = Rete Sli 9) = 
pa ae yt (7) 
Coe oe, 1e(éc(w, ae) a 


es ys ye (i)e(v, xt) 


For more information about Baum-Welch algorithm, please see [20]. 
It is clear that the E-step and M-step equations are very similar 
to the standard HMM except that instead of emitting one symbol, 
HMMULM emits one lecture represented as a bag-of-words. 


5.4 Structural Topic Model 

The Structural Topic Model (StrT™M) [25] is another probabilistic 
graphical model that functions very similar to HMMULM. It has 
been used to model the latent topical structures inside documents. 
Like HMMULM, it models topics and their transitions as hidden 
states that emit lectures as bags-of-words. Unlike HMMULM, strT™M 
assumes each lecture as a mixture of content topics and functional 


Table 1: The dataset utilized in the experiment. 


Domain | # of Courses | # of Lectures | Avg # of Lectures 
Python 21 460 22 
SQL 15 247 16 
ML 10 99 10 


topic. Functional topic, denoted by zp, is used to filter out document- 
independent words that models the corpus background (or general 
terms) [31]. Each word in the lecture is either generated by one of 
the content topics or the functional topic: 


w ~ OP(w|B, 2) + (1 — 8) P(w|8, 26) a) 


where @ is the controlling parameter. According to strTM, the 
probability of lecture x; being generated by some topic 2; is: 


P(aj|zi) = |] (OP(wI6, x) +14) P(wI6, ze)" (12) 


weVv 


Another difference between strTM and HMMULM,, is that strTM 
assumes the transition probabilities A and the emission probability 
B are drawn from Multinomial distributions and use the conju- 
gate Dirichlet distribution to impose a prior on the Multinomial 
distributions: 


az ~ Dir(n) (13) 


Bz ~ Dir(y) (14) 


Where 77 and ¥ are the concentration hyper parameters that control 
sparsity of a, and §, respectively. 


To estimate the parameters of strTM, we use the expectation- 
maximization algorithm as described by [25]. For more information 
about strTM, please refer to [25]. 


6. EVALUATION 


In this section, we first demonstrate our dataset and the parameters 
settings. Second, we compare different models by studying the impact 
of topic transitions learned from various models on three lecture 
sequencing tasks. Finally, we qualitatively evaluate the topics and 
their transitions. 


6.1 Dataset and Parameters Settings 

We collected our dataset from real online courses using various 
MOOC platforms and in three different domains: Python, Structural 
Query Language (SQL), and Machine Learning Clustering algo- 
rithms (ML). Table 1 presents the statistic of the dataset. We use 
75% of the data as a training set and 25% as a test set. To choose the 
number of topics in each domain, we manually inspected the dataset 
to choose the number of topics. The number of topics for Python, 
SQL, and ML were set to 13, 10, and 9 respectively. 


Each course in the dataset is represented as a sequence of lecture video 
transcripts. We preprocess lecture transcripts by eliminating stop 
words and some rare terms. After cleaning the data, we constructed 
the bag-of-word vector representations of all lectures. We only use 
lecture transcripts to represent lectures; therefore, we only need to 
set two thresholds (/¢, and K2) of the PCK-Means method in order 
to select the list of Must-link and Cannot-link constraints. Since 
we do not have labeled data we chose the thresholds that maximize 
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Table 2: The performance of Task 1: Finding the correct sequence using the permutation method. It is clear that PCK-Means 


achieves the highest performance. 


Methods 
Dataset ' Measures 
Cosine | PCK-Means | MULM | HMMULM | strT™M 
kendall’s t(c) 0.60 0.73 0.49 0.66 0.54 
Dir-P 0.37 0.50 0.43 0.28 0.31 
Python 
Undir-P 0.52 0.56 0.50 0.44 0.35 
Lev-Sim 0.60 0.63 0.45 0.49 0.50 
kendall’s t(c) 0.58 0.59 0.58 0.55 0.41 
Dir-P 0.46 A Al f Al 
SQL ir 0.43 0.40 0.36 0 
Undir-P 0.67 0.58 0.49 0.44 0.41 
Lev-Sim 0.53 0.57 0.54 0.49 0.37 
kendall’s t(c) 0.68 0.75 0.63 0.58 0.64 
ML Dir-P 0.34 0.34 0.44 0.34 0.43 
Undir-P 0.52 0.52 0.57 0.41 0.53 
Lev-Sim 0.61 0.65 0.55 0.43 0.53 


the Silhouette Coefficient clustering measure using the training data. 


We set Ky = 0.55 and Ke = 0.004 for Python, Ki = 0.8 and 
Ke = 0.01 for SQL, and AK, = 0.55 and K2 = 0.01 for ML. To 
set the hyper parameters of strTM method, we used a grid search 
and chose the values that maximize the likelihood of the training 
data. We set 0 = 0.2, y = 0.3, and 7 = 0.6 for Python, @ = 0.1, 
y = 0.3, and 7 = 0.1 for SQL, and 6 = 0.1, y = 0.1, and n = 0.6 
for ML. 


6.2 Sequencing Tasks 

In this experiment, our goal is to compare the topic transitions 
modeled by different methods in three tasks: 1) Finding the correct 
sequence of lectures, 2) Predicting the next lecture given a sequence 
of lectures, and 3) Predicting the sequence of a list of lectures 
where the first lecture in the sequence is given. An example of real 
application for task 1 and task 3 is designing a new course plan by 
sequencing lectures before delivering them to students. However, 


task 1 and task 3 exploit two different techniques to find the sequence. 


In contrast, task 2 can be applied to recommend the next lecture to 
learners to customize their learning based on the history of lectures 
they already watched. In the evaluation, the purpose of each task is to 
compare different methods and evaluate the ability of the parameters 
(A, B, and 7) of each model to find the correct sequence in the three 
different tasks. 


6.2.1 Evaluation Measures 

To compare different models, we use the sequences of lectures 
from courses in the test set as the ground truth sequences and 
exploit different measures to do the evaluation. First, we follow 
Wang et al. [25] and use kendall’s r(c). Kendall’s r(o) is an 
information retrieval measure that captures the correlation between 
two ranked list. It indicates how the predicted order differs from 
the ground truth where 1 means perfect match, —1 means total 
mismatch, and 0 indicates that the two orders are independent. 
Second, we use Levenshtein normalized similarity which is the 
opposite of Levenshtein normalized distance that measures the 
minimum number of edits (insertions, deletions or substitutions) 
required to transform the predicted sequence to the ground truth 
sequence. The goal is to find the sequence that has the Levenshtein 
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normalized similarity close to 1 which indicates that the number 
of edits required is minimal. Third, we utilize the directed bigram 
precision (see equation 15) that captures the correctness of the order 
between adjacent lectures. The intuition behind using this measure is 
to evaluate whether the transition maps learned by different models 
have the ability to capture the correct direction order between topics 
and adjacent lectures. Finally, we use the undirected bigram precision 
shown in equation 16 to measure whether the transition map of each 
model can recognize adjacent lectures but incorrectly captured the 
direction between topics. 


# of correct(a — b) in estimated sequence 


# of correct(a > b) in ground truth 
(15) 


Poir bigram = 


# of correct{a, b} in estimated sequence 


PB, nae Gite —_ 
Raat PERC # of correct{a, b} in ground truth 


(16) 


6.2.2 Task 1: Finding The Correct Sequence 

To find the correct sequence of lectures, we follow the permutation 
method utilized by [25]. With courses that have large number of 
lectures, it is infeasible to find all the orderings of lectures. Therefore, 
when the number of the permutations exceeds 500, we randomly 
permutated 500 possible orderings of lectures as candidates. We ran 
the experiment 20 times for each method and recorded the average 
results. 


In order to select the optimal sequence from the list of permutations 
in strTM and HMMULM, we follow Wang et al. [25] and choose the 
sequence that has the highest generation probability calculated as: 


a(m) = arg max ys P( £50}, Loli]; s++5Xo[m]s Z\X) (17) 
Z 


o(m) 


To choose the best sequence for MULM, we first find the best topic 
c that generates each lectures in the test set according to equation 
4. After that, we select the sequence that has the highest likelihood 
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Figure 3: The performance of different methods in Task 3: Predicting the whole sequence. All methods have comparable performance. 


based on the equation: 
a(m) = argmax P(C)P(X|C) 
a(m) 
|o(m)| (18) 
P(c1)P(a1|c1) ) 2 P(cilci-1) P(x: |ci) 


Since PCK-Means is a clustering method that minimizes the distance 
between lectures and clusters’ centroids, we assign lectures x of the 
test set to the closest clusters z; using Euclidean distances as shown 
in equation 19. Then, we select the sequence that maximizes the topic 
transitions between lectures in the sequence as well as minimizes the 
distance between adjacent lectures (see equation 20). The intuition 
behind that is to ensure the topic coherence between adjacent lectures 
and also reduces the gaps by minimizing the distance between them. 


c(x) = arg min la — pz, || (19) 


an 


lo(m)| 


a(m) = arg max 7(c(#1)) > A(xs—1, 24) — lve — 24-1||? 


a(m) j=2 
(20) 


As a baseline we accumulate the cosine similarity between adjacent 
lectures in the sequence and select the sequence in the permutations 
that has the highest similarity score to be the optimal sequence. 


Table 2 summarizes the results of Task 1 for each method. We can 
notice that PCK-Means has the highest score in kendall’s (a) and 
Levenshtein normalized similarity in all datasets which indicates that 
PCK-Means has chosen the sequences that are very correlated to the 


Table 3: The performance of Task 2: Predicting the next 
lecture. It is clear that HMMULM achienes the highest 


performance. 
Accuracy 
Method Python SQL ML 
Cosine-Similarity 0.46 0.56 0.42 
PCK-Means 0.45 0.49 0.47 
MULM 0.41 0.34 0.37 
HMMULM(Viterbi) 0.52 0.56 0.60 
StrTM(Viterbi) 0.39 0.27 0.43 


ground truth sequences and need the minimal edits to be transformed 
to the ground sequences. However, PCK-Means only outperforms 
other models in the directed and undirected bigram precision in the 
Python dataset, indicating that it sometimes not able to capture the 
sequence between adjacent lectures. 


In general, it is clear that PCK-Means achieves the highest perfor- 
mance in most measures and almost in all the datasets. We think that 
combining the topic transitions with the Euclidean distance helps 
PCK-Means in finding the best sequence from the list of possible 
sequences. 


6.2.3 Task 2: Predicting The Next Lecture 

In task 2, each model predicts the next lecture given a sequence of 
lectures. We varied the length of the given sequence starting from 
one. As strTM and HMMULM are based on HMM, we utilized the 
Viterbi algorithm [20] to find the most probable sequence of hidden 
states or topics that generated the lectures in the given sequence. 
Then we greedily choose the next probable lecture in the sequence 


Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 145 


«types of data(python quick start) MULM 
«tuples in python(python quick start) 

«built-in functions in python(python quick start) 

«lists in python(python quick start) 

:for loop(python quick start) 

.store data using variables(python quick start) 
.if-elif and if-elif-else statements(python quick start) 
8 . 4.types of functions(python guick start) 
predicted_topics = a cr ae 10 9 9] 


actual_topics = [610 6 9 6 6 9 12] 


NOUBWNE 
PNNONWOPR 


+ l.,unsupervised learning(columbia machine learning) str™M 
- 3.convergence of k-means(columbia machine learning) 
- 4,applications of k-means(columbia machine learning) 
- 5.principal component analysis(columbia machine learning) 
.| 6.Jpca: general(columbia machine learning) 
| 8..kernel pca(columbia machine learning) 

7,Jprobabilistic pca(columbia machine learning) 


OANDUBWNE 


predicted_states = 
actual_states = [4131277 7] 


stypes of data(python quick start) HMMULM 


- for loop(python quick start) 
«lists in python(python quick start) 
«tuples in python(python quick start) 
«built-in functions in python(python quick start) 
«store data using variables(python quick start) 
.types of functions(python quick start) 
. 7.if-elif and if-elif-else statements(python quick start) 
predicted_states = [7 71111 A cmon 
actual_states = [711 4 311 5 3 7] 


ONDUEWNE 
NPNWOHUDPH 


HMMULM 


1. 1.unsupervised learning(columbia machine learning) 

2 . 3.convergence of k-means(columbia machine learning) 

3. 4,applications of k-means(columbia machine learning) 

4 . 2.clustering(columbia machine learning) 

5 .| 8.\kernel pca(columbia machine learning) 

6 .| 5.principal component analysis(columbia machine learning) 
7 .| 6.Jpca: general(columbia machine learning) 

8 .| 7.probabilistic pca(colum ine learning) 
predicted_states = [4 2 23 

actual_states = [42230 


; PCK-Means 
.types of data(python quick start) 


+ lists in python(python quick start) 

- for loop(python quick start) 

.types of functions(python quick start) 

.if-elif and if-elif-else statements(python quick start) 
«built-in functions in python(python quick start) 
«tuples in python(python quick start) 

»Store data using variables(python quick start) 
predicted_topics = [ 3 4 8 611 1] 

actual_topics = [3 1 6 4 011 8 @] 


ONDUNEPWNR 
NOWNPOUR 


PCK-Means 
» 1.,unsupervised learning(columbia machine learning) 
«convergence of k-means(columbia machine learning) 
.|Clustering(columbia machine learning) 
|4.Applications of k-means(columbia machine learning) 
.| 7.probabilistic pca(columbia machine learning) 
.|6.pca: general(columbia machine learning) 
.|5.principal component analysis(columbia machine learning) 
ernel pca(columbia_ machine learning) 
predicted_topics = CH SR) 


actual_topics = [60008888 


OAYAUDBRWNPH 


(a) 


(b) 


Figure 4: Qualitative Analysis of Sequencing Task 3. (a) Examples of preferring self transition behaviour when selecting next lecture 
in the sequence, (b) Examples of the problem of sequencing adjacent lectures that cover the same topics. 


according to the equation: 


& = argmax P(z;|z;-1) P(2|z:) 


(21) 


Similar to task 1, for MULM and PCK-Means models we assign 
lectures to the best clusters using the equations 4 and 19 respectively. 
After that MULM greedily chooses the next lecture that maximizes 
the equation 21. On the other hand, PCK-Means model selects the 
next lecture that maximizes the topic transition and minimizes the 
distance with the last lecture in the given sequence. For the baseline, 
we use the cosine similarity where we choose the next lecture that 
has the highest similarity score with the last lecture in the given 
sequence. 


Table 3 summarizes the results of Task 2 for each method. We 
can notice that HMMULM achieves the highest accuracy in all 
datasets. Using the Viterbi algorithm along with the learned topic 
transitions helps in capturing the most probable hidden states or 
topics that generate the given sequence of lectures. In addition, 
the topic transitions learned by HMMULM help in greedily pick 
the next lecture in the sequence. While StrTM also uses Viterbi 
algorithm similar to HMMULM, its accuracy scores were far less 
than HMMULM. We think the main reason for that due to the 
performance of the learned topic transitions as we explain in section 
6.3. 


6.2.4 Task 3: Predicting The Sequence 


Task 3 is very similar to task 2 except that each method needs to find 
the whole sequence of given lectures where the first lecture in the 
sequence is given. Figure 3 depicts the results of Task 3. 


As this task is considered the most challenging task, it is clear that 
there is no wining method. However, from the upper left graph that 
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captures the Kendall’s taus in Figure 3, we can notice that HMMULM 
has achieved a taus score > 0.50 in four courses, PCK-Means has 
achieved the same score in only two courses, MULM and strT™M 
in only one course, and Cosine method in non courses. For the 
Levenshtein normalized similarity, it is clear that all methods have 
comparable results. For the directed and undirected bigram precision, 
all methods have also comparable results except MULM. The reason 
is that MULM sometimes cannot complete the whole sequence 
because it only uses the greedy method which cannot complete the 
sequence in the case of the absence of the topic transitions required 
to sequence courses in the test set. In the case of other methods, 
they always find the whole sequence either because of the Viterbi 
algorithm used by HMMULM and strTM or due to the similarity or 
distance measures utilized by PCK-Means and Cosine methods. 


In addition to quantitatively comparing the methods, we try to quali- 
tatively evaluate the results by examining the generated sequences of 
each methods. In general, we found two common behaviour shared 
by all methods. 


First, in most cases almost all the methods prefer self transition when 
they pick the next lecture in the sequence. For example, as shown in 
Figure 4 (a), MULM, HMMULM, and PCK-Means select the next 
lecture that has the same topic as the current lecture. 


Second, all methods cannot sequence lectures that belong to the same 
topic. In MOOCs, due to the short length of lectures, instructors 
sometimes explain the same topic using multiple lectures. As a result, 
it is hard to find the correct sequence of lectures that cover the same 
topic. For example, as shown in Figure 4 (b), the last four lectures 
of the course explain the “Principal Component Analysis algorithm’ 
and hence strTM, HMMULM, and PCK-Means cannot predict the 
correct sequence of these lectures. In this case, we need to use other 
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Figure 5: Python topics and their transitions using PCK- 
Means method. The table represents the top terms in each 
topic. 


techniques to predict the sequence. One naive solution is to assume 
that all adjacent lectures the belong to the same topic as one atomic 


unit and we only need to sequence lectures that have different topics. 


Further investigation of solving the sequencing problem of lectures 
that belong to the same topic is left for future work. 


6.3 Topic Transitions Examples 

In this section, we present examples of topics and topic transitions 
learned by different methods. Due to space constraint, we present 
examples using Python dataset. We try to analyze the words with the 
highest probabilities in the word distributions of topics learned by 


each methods and manually mapped them to topic words or phrases. 


For instance, if the word distribution has the words: list, range, 
items, index, and append, then it is clear that this word distribution 
captures the topic “List”. The word distributions with topic phrases 
of each topic learned by PCK-Means, MULM, HMMULM, and 
strTM methods in Python dataset are depicted in Figure 5, 6, 7, and 
8 respectively. Since we have 13 topics in the Python dataset, we 
only visualize the topic transitions of a subset of these topics and 
depicted the transitions that have scores > 0.05. 


It is clear from the Figures that all models extract some useful topics 
where the top terms of each topic clearly explain the topic. However, 
PCK-Means has the best word distributions that clearly explain each 
topic followed by HMMULM and then MULM while strTM has the 
lowest performance. We also notice from the Figures that PCKMeans 
have extracted 11 useful topics with two topics that have unclear 
word distributions and cannot be mapped to any useful topics. In 
contrast, MULM has modeled 10 meaningful topics with three topics 
form noise and cannot be mapped to any topics. On the other hand, 
HMMULM and strTM capture 9 topics with four unclear topics that 
cannot be mapped to any phrase. In general, this finding indicates 
that PCK-Means has the best performance in modeling the topics 
of the courses in the Python dataset as it models more useful topics 
with clear word distributions. The results also indicate that strTM 
achieves the lowest performance because even though it captures the 
same number of meaningful topics as HMMULM, strTM has the 
lowest performance in the clarity of the word distributions. 
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method. The table represents the top terms in each topic. 
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Figure 8: Python topics and their transitions using strTM 
method. The table represents the top terms in each topic. 
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As shown in the Figures 5, 6,7, and 8, the meaningful topics extracted 
by all methods are very similar with some variations. For example, 
while PCK-Means and MULM separate “Files” and “CSV Files’, 
HMMULM and strT™M combines them into one topic. In addition, 
while PCK-Means and HMMULM combines “Loops & Condition’, 
MULM differentiates between them. StrTM, on the other hand, has 
both “Loops & Condition” and “Conditional Statements.” 


Regarding the topic transitions, it is clear that all models capture self 
transitions with the topic and itself. This indicates that, in MOOCs, 
instructors used multiple lectures to explain the same topic. However, 
HMMULM gave higher probability to self transitions compared to 
other methods. We can notice from the Figures that there are some 
consensus between all methods on some transitions between topics 
such as: “List” — “Dictionary” and “String” — “File.” There are also 
some variations of topic transitions between different models. For 
instance, wile PCK-Means, HMMULM, and strTM have a transition 
between “String’— “List”, MULM combines these two topics into 
one topic or cluster. Another variation is that, PCK-Means, strTM and 
MULM have a transition “Loop & Condition” — “String”, whereas 
HMMULM misses this transition. 


In general, all methods captures useful topics with clear word dis- 
tributions. Regarding the topic transitions, all methods capture self 
transitions and also have some consensus on some transitions. There 
are also some variations between methods and these differences due 
to how each method identify topics of each lecture. Improving the 
modeling of topics and the mapping between lectures and topics 
clearly would improve the quality of the topic transition maps. 


7. CONCLUSION 


In this paper, we introduce the Topic Transition Map which is a 
general structure that models the content of MOOCs as topics, where 
each lecture is mapped to a topic, and captures the transition between 
topics. It models the various ways of how instructors organize topics 
in order to construct the study plan of their courses. We investigate 
four different methods to construct the Topic Transition Map: PCK- 
Means, MULM, HMMULM, and strTM. PCK-Means and MULM 
separately cluster lectures into topics and then learn the transitions 
between topics, by leverage the sequences of lectures in different 
courses. In contrast, HMMULM and strTM assume first order Markov 
property among latent topics and hence jointly learn topics and their 
transitions. While the three model, MULM, HMMULM, and strTM 
are probabilistic models, PCK-Means is distance-based clustering 
algorithm that incorporates some constraints to guide the clustering 
process. 


We evaluated the generated topic transitions from various methods 
using three different tasks: 1) determining the correct sequence, 
2) predicting the next lecture, and 3) predicting the sequence of 
lectures. Our evaluation revealed that PCK-Means achieves the 
highest performance in determining the correct sequence while 
HMMULM outperforms other methods in the task of predicting 
the next lecture. Since the task of predicting the whole sequence 
of lectures is considered the most challenging task, there was no 
winning method and all methods have comparable performance with 
MULM has the lowest performance as it sometimes fails to predict 
the whole sequence. We also visualize the the Topic Transition Maps 
generated by different methods to qualitatively evaluate the resulted 
maps. We found that PCK-Means has extracted more meaningful 
topics with the best word distributions that clearly explain each topic. 


In the future, we plan to explore incorporating Topic Transition 
Map with concept dependency relations and examine if this can 
solve the problem of sequencing lectures that belong to the same 
topic. Further, we aim to combine different methods such as PCK- 
Means and HMMULM in order to improve the accuracy of the 
Topic Transition Map and hence improving the performance of 
the sequencing tasks. Finally, we plan to apply our work on other 
domains such as traditional University courses or educational books. 
To do that, we need to investigate how to divide long lectures or book 
sections into segments where each segment is mapped to one topic. 
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